Supervised Classification Using Balanced Training
نویسندگان
چکیده
We examine supervised learning for multi-class, multi-label text classification. We are interested in exploring classification in a realworld setting, where the distribution of labels may change dynamically over time. First, we compare the performance of an array of binary classifiers trained on the label distribution found in the original corpus against classifiers trained on balanced data, where we try to make the label distribution as nearly uniform as possible. We discuss the performance tradeoffs between balanced vs. unbalanced training, and highlight the advantages of balancing the training set. Second, we compare the performance of two classifiers, Naive Bayes and SVM, with several feature-selection methods, using balanced training. We combine a Named-Entity-based rote classifier with the statistical classifiers to obtain better performance than either method alone.
منابع مشابه
Using statistical text classification to identify health information technology incidents
OBJECTIVE To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in the USA Food and Drug Administration (FDA) Manufacturer and User Facility Device Experience (MAUDE) database. DESIGN We used a subset of 570 272 incidents including 1534 HIT incidents reported to MAUDE between 1 January 2008 and 1 July 2010. ...
متن کاملSemi-supervised Learning from Unbalanced Labeled Data - An Improvement
We present a great improvement while performing semi-supervised learning tasks from training data sets when only a small fraction of the data pairs is labeled. In particular, we propose a novel decision strategy based on normalized model outputs. We give the explanation why the normalization step helps. The paper compares performances of two popular semi-supervised approaches (Consistency Metho...
متن کاملBootstrapping Supervised Machine-learning Polarity Classifiers with Rule-based Classification
In this paper, we explore the effectiveness of bootstrapping supervised machine-learning polarity classifiers using the output of domain-independent rule-based classifiers. The benefit of this method is that no labeled training data are required. Still, this method allows to capture in-domain knowledge by training the supervised classifier on in-domain features, such as bag of words. We investi...
متن کاملExploiting Unlabeled Texts with Clustering-based Instance Selection for Medical Relation Classification
Classifying relations between pairs of medical concepts in clinical texts is a crucial task to acquire empirical evidence relevant to patient care. Due to limited labeled data and extremely unbalanced class distributions, medical relation classification systems struggle to achieve good performance on less common relation types, which capture valuable information that is important to identify. O...
متن کاملClassification of ECG signals using Hermite functions and MLP neural networks
Classification of heart arrhythmia is an important step in developing devices for monitoring the health of individuals. This paper proposes a three module system for classification of electrocardiogram (ECG) beats. These modules are: denoising module, feature extraction module and a classification module. In the first module the stationary wavelet transform (SWF) is used for noise reduction of ...
متن کامل